Penguins

Tidy Tuesday: Penguins

The objectives for this week are manyfold. First, last week we talked about yaml files briefly, but I’ve never manipulated this metadata intentionally in the past. Second, after watching this talk by Desirée De Leon I wanted to attempt some of the principles that were used to create Teacups, giraffes and statistics:

  1. good characters
  2. good play
  3. good design

Third, I wanted to further explore the TidyTuesday dataset on Penguins using the GGally package.


Let’s get started!

  • We got the good characters art down since it is provided by Dr. Allison Horst, along with data collected and made available by Dr. Kristen Gorman, and a nice package that was developed with Dr. Allison Hill: palmerpenguins, think of iris, but with penguins.

  • For the good play I will incorporate an interactive component to this Rmarkdown. Aside from using plotly for visualization, we could use the package learnr which present data/information in a format that has optimal tutorial elements (e.g. equations, videos, code exercise, quizzes, shiny components).

  • In the good design criteria I’ve incorporated div tips to make the document stand out a bit. Here’s a link on how to make them.

There are a few options incorporated into the YAML configuration.The header used was imported from an HTML file targeting a local image file of Iter penguins.
Available highlighting styles for code chunks can be listed with the following line in the terminal: pandoc --list-highlight-styles. For this file, I used zenburn. File themes can also be updated from the default using pre-packaged themes, or we can download R packages with additional themes, check out this blogpost, in this document I used simplex theme.
For all the div tips used here, I incorporated the colors from the Iter penguins artwork using color slurp and the Google font Indie Flower was imported to the CSS style file. The images within div tips are courtesy of Desirée De Leon.

Load libraries

We will use plotly for interactive plots, GGally for scatterplot matrix correlograms.

suppressPackageStartupMessages(library(tidyverse))
library(plotly)
library(skimr)
library(GGally)

Get Data

penguins<- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-28/penguins.csv')

Inspect Data

skim(penguins)
Data summary
Name penguins
Number of rows 344
Number of columns 8
_______________________
Column type frequency:
character 3
numeric 5
________________________
Group variables

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
species 0 1.00 6 9 0 3 0
island 0 1.00 5 9 0 3 0
sex 11 0.97 4 6 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6 ▃▇▇▆▁
bill_depth_mm 2 0.99 17.15 1.97 13.1 15.60 17.30 18.7 21.5 ▅▅▇▇▂
flipper_length_mm 2 0.99 200.92 14.06 172.0 190.00 197.00 213.0 231.0 ▂▇▃▅▂
body_mass_g 2 0.99 4201.75 801.95 2700.0 3550.00 4050.00 4750.0 6300.0 ▃▇▆▃▂
year 0 1.00 2008.03 0.82 2007.0 2007.00 2008.00 2009.0 2009.0 ▇▁▇▁▇

bill

Data Wrangling

The downloaded data is pretty clean, besides filtering missing data from the sex variable, I just tallied observations.

penguins_df<- penguins %>%
  drop_na() %>%
  mutate(year=as.factor(year))

Visualization

ggpairs(penguins_df)

Interactive

Click on legend label to remove observations from plots!

p<- ggplot(penguins_df, aes(flipper_length_mm, bill_length_mm, fill= species, color=species)) +
  geom_point() +
  geom_smooth(method='lm', formula= y~x) +
  hrbrthemes::theme_ipsum() +
  scale_fill_manual(values = c("#FF8000", "#C85BCA", "#0E7274")) +
  scale_color_manual(values = c("#FF8000", "#C85BCA", "#0E7274"))

ggplotly(p, height = 800, width = 800)
p2<- ggplot(penguins_df, aes(body_mass_g, island, color=species)) +
  geom_point() +
  facet_grid(~sex)+
  hrbrthemes::theme_ipsum() +
  scale_color_manual(values = c("#FF8000", "#C85BCA", "#0E7274"))

ggplotly(p2, height = 800, width = 800)